Graph-Based N-gram Language Identification on Short Texts

نویسندگان

  • Erik Tromp
  • Mykola Pechenizkiy
چکیده

Language identification (LI) is an important task in natural language processing. Several machine learning approaches have been proposed for addressing this problem, but most of them assume relatively long and well written texts. We propose a graph-based N-gram approach for LI called LIGA which targets relatively short and ill-written texts. The results of our experimental study show that LIGA outperforms the state-of-the-art N-gram approach on Twitter messages LI.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementation and Evaluation of a Language Identification System for Mono- and Multi-lingual Texts

Language identification is a classification task between a pre-defined model and a text in an unknown language. This paper presents the implementation of a tool for language identification for mono-and multilingual documents. The tool includes four algorithms for language identification. An evaluation for eight languages including Ukrainian and Russian and various text lengths is presented. It ...

متن کامل

Language Identification on the Web: Extending the Dictionary Method

Automated language identification of written text is a wellestablished research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n-grams are in use, mainly with identification based on Markov models or on character n-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world...

متن کامل

Automatic identification of language varieties: The case of Portuguese

Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. Stateof-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classifica...

متن کامل

Unsupervised Clustering for Language Identification

The current state of the art in language identification comes from n-gram language models. While these can reach 99% accuracy (Hammarstrom, 2007), they have three major shortcomings. First, n-gram language models are supervised. They require substantial labeled training data in each language in order to be functional. For best results, this training data should also be in the same genre as the ...

متن کامل

Entity Recognition and Language Identification with FELTS

This working notes describe the experiments we conducted in the Microblog Cultural Contextualization Lab [2] of CLEF 2017 [3]. The microblog data is composed of very short texts, with very heterogeneous styles. Some of them are written in more than one language. We decided to takle the entity recognition problem by using a non-statistical, dictionary-based, multiword term extractor. On the othe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011